17 research outputs found

    Evaluation of terminologies acquired from comparable corpora : an application perspective

    Get PDF
    International audienceThis paper describes a protocol for the evaluation of bilingual terminologies acquired from comparable corpora. The aim of the protocol is to assess the terminologies'added-value in a task of specialized translation. The protocol consists in having specialized texts translated in various situations: without any specialized resource, with an domain-related bilingual terminology or using Internet. By comparing the quality of the segments translated using these various resources, we are able to assess the impact of our bilingual terminologies on the quality of the translation

    Identification of Fertile Translations in Medical Comparable Corpora: a Morpho-Compositional Approach

    Get PDF
    This paper defines a method for lexicon in the biomedical domain from comparable corpora. The method is based on compositional translation and exploits morpheme-level translation equivalences. It can generate translations for a large variety of morphologically constructed words and can also generate 'fertile' translations. We show that fertile translations increase the overall quality of the extracted lexicon for English to French translation

    Identification de compatibilités entre tags descripteurs de lieux et apprentissage automatique

    Get PDF
    International audienceLes travaux présentés dans cet article s'inscrivent dans le paradigme des recherches visant à acquérir des relations sémantiques à partir de folksonomies (ensemble de tags attribués à des ressources par des utilisateurs). Nous expérimentons plusieurs approches issues de l'état de l'art ainsi que l'apport de l'apprentissage automatique pour l'identification de relations entre tags. Nous obtenons dans le meilleur des cas un taux d'erreur de 23,7 % (relations non reconnues ou fausses), ce qui est encourageant au vu de la difficulté de la tùche (les annotateurs humains ont un taux de désaccord de 12%)

    Traduction assistée par ordinateur et corpus comparables : contributions à la traduction compositionnelle

    Get PDF
    Financement : projet ANR Metricc (subvention ANR-08-CORD-013), ANRT (CIFRE n° 2010/270), sociĂ©tĂ© Lingua et MachinaOur work deals with the extraction of bilingual lexicons from comparable corpora with an application to specialized translation. We started by evaluating classical methods based on the distributional hypothesis (the more two terms appear in similar contexts, the more likely they are translations of each other) in a user-oriented fashion. This evaluation raised the fact that translators feel very uncomfortable with this kind of lexicon: they feel correct translations are uneasy to spot in the lists of candidate translations and would rather use a smaller lexicon but with higher precision rates. Based on this observation, we turned to another approach for term translation which has been recently and successfully experimented on comparable corpora and produce lexicons that meet the demands of the translators: compositional translation. In this framework, the translation of a term is composed of the translation of its parts. We concentrated on the translation of monolexical terms : the source term is decomposed into morphemes, morphemes are translated into the target language and recomposed as a target term. We investigated three lines of research: generation of fertile translations (cases in which the target term has more lexical words than the source term), independence to morphological structure and candidate translation ranking.Notre travail concerne l'extraction de lexiques bilingues Ă  partir de corpus comparables, avec une application Ă  la traduction spĂ©cialisĂ©e. Nous avons d'abord Ă©valuĂ© les mĂ©thodes classiques d'acquisition de lexiques en corpus comparables (basĂ©es l'hypothĂšse distributionnelle : plus deux termes apparaissent dans des contextes similaires, plus il y a de chances qu'ils soient des traductions) d'un point de vue applicatif. L'Ă©valuation a montrĂ© que les traducteurs sont mal Ă  l'aise avec les lexiques extraits : la traduction correcte est trop souvent noyĂ©e dans une liste de traductions candidates et ils prĂ©fĂšreraient utiliser un lexique plus petit mais plus prĂ©cis. Partant de ce constat, nous nous sommes orientĂ©s vers une autre approche qui a fait rĂ©cemment ses preuves pour l'exploitation des corpus comparables et produit des lexiques plus adaptĂ©s aux besoins des traducteurs : la traduction compositionnelle (la traduction du terme source est fonction de la traduction de ses parties). Nous nous sommes concentrĂ©s sur la traduction d'unitĂ©s monolexicales : le terme source est dĂ©coupĂ© en morphĂšmes, les morphĂšmes sont traduits puis recomposĂ©s en un terme cible. Dans ce cadre, nous avons poursuivi trois axes de recherche : la gĂ©nĂ©ration de traductions fertiles (cas oĂč le terme cible contient plus de mots lexicaux que le terme source), l'indĂ©pendance aux structures morphologiques et l'ordonnancement des traductions candidates

    Investigating the Structure of Procedural Texts for Answering How-to Questions, LREC 2008

    No full text
    This paper presents ongoing work dedicated to parsing the textual structure of procedural texts. We propose here a model for the intructional structure and criteria to identify its main components: titles, instructions, warnings and prerequisites. The main aim of this project, besides a contribution to text processing, is to be able to answer procedural questions (How-to? questions), where the answer is a well-formed portion of a text, not a small set of words as for factoid questions. 1. Situation and Aims The main goal of this work is to be able to answer procedural questions, which are questions whose induced response is typically a fragment, more or less large, of a procedure, i.e., a set of coherent instructions designed to reach a goal. Recent informal observations from queries to Web search engines show that procedural questions is the second largest set of queries after factoid questions (de Rijke, 2005). Answering procedural questions thus requires to be able to extract not simply a word in a text fragment, as for factoid questions, but a well-formed text structure which may be quite large. Analysing a procedural text requires a dedicated discourse analysis, e.g. by means of a grammar. Such grammars are not very common yet due to the complex intertwinning of lexical, syntactic, semantic and pragmatic factors they require to get a correct analysis. Discourse grammars have basically a top-down organization, they take discourse acts as their basic units, instead of just words, they account for the structure and for the interactions between these acts and they require a relatively elaborated conceptual representation as output. Such a grammar must capture the discourse cohesion, possibly the communicative intentions, as well as the discourse organization, e.g. in terms of plans. Procedural texts are organized sets of instructions, they may also be sets of advices, as in social behavior texts. In our perspective, procedural texts range from apparently simple cooking recipes to large maintenance manuals. They also include documents as diverse as teaching texts, medical notices, social behavior recommendations, directions for use, assembly notices, do-it-yourself notices, itinerary guides, advice texts, savoir-faire guide

    The Airbus Air Traffic Control speech recognition 2018 challenge: towards ATC automatic transcription and call sign detection

    Get PDF
    International audienceIn this paper, we describe the outcomes of the challenge organized and run by Airbus and partners in 2018 on Air Traffic Control (ATC) speech recognition. The challenge consisted of two tasks applied to English ATC speech: 1) automatic speech-to-text transcription, 2) call sign detection (CSD). The registered participants were provided with 40 hours of speech along with manual transcriptions. Twenty-two teams submitted predictions on a five hour evaluation set. ATC speech processing is challenging for several reasons: high speech rate, foreign-accented speech with a great diversity of accents, noisy communication channels. The best ranked team achieved a 7.62% Word Error Rate and a 82.41% CSD F1-score. Transcribing pilots' speech was found to be twice as harder as controllers' speech. Remaining issues towards solving ATC ASR are also discussed in the paper

    Identification de compatibilités entre descripteurs de lieux et apprentissage automatique

    No full text
    Les travaux prĂ©sentĂ©s dans cet article s’inscrivent dans le paradigme des recherches visant Ă  acquĂ©rir des relations sĂ©mantiques Ă  partir de folksonomies (ensemble de tags attribuĂ©s Ă  des ressources par des utilisateurs). Nous expĂ©rimentons plusieurs approches issues de l’état de l’art ainsi que l’apport de l’apprentissage automatique pour l’identification de relations entre tags. Nous obtenons dans le meilleur des cas un taux d’erreur de 23,7 % (relations non reconnues ou fausses), ce qui est encourageant au vu de la difficultĂ© de la tĂąche (les annotateurs humains ont un taux de dĂ©saccord de 12%
    corecore